Skip to content

lance-graph-contract: C++ codegen target (MethodSig) + UniCharSet content store#521

Merged
AdaWorldAPI merged 9 commits into
mainfrom
claude/happy-hamilton-0azlw4
Jun 17, 2026
Merged

lance-graph-contract: C++ codegen target (MethodSig) + UniCharSet content store#521
AdaWorldAPI merged 9 commits into
mainfrom
claude/happy-hamilton-0azlw4

Conversation

@AdaWorldAPI

@AdaWorldAPI AdaWorldAPI commented Jun 17, 2026

Copy link
Copy Markdown
Owner

The Core-side of the Tesseract C++→Rust transcode — the types ruff's ruff_cpp_codegen targets, plus the byte-parity probe's Rust side. All additive to lance-graph-contract: zero NodeRow/ValueTenant/ValueSchema/stride/ENVELOPE_LAYOUT_VERSION impact (container-architect ADDITIVE-CONFIRMED).

codegen_manifestMethodSig: the &'static-backed, const-constructible method-signature type the generated text names (the method-axis sibling of ClassView's field projection; FieldRef is String-backed and can't appear in a const, which is exactly why this is a new type). ClassMethods + methods_for: the registry entry + zero-fallback lookup; classid is bound OGAR-side, the data is generated downstream (no runtime registry stored here).

unicharsetUniCharSet (deepnsm::Vocabulary-shaped: reverse id→unichar + lookup unichar→id), .unicharset parser + id_to_unichar/unichar_to_id + dump(). The Rust side of PROBE-OGAR-ADAPTER-UNICHARSET: pure text parsing, zero leptonica (the unicharset path never touches Pix), so it builds + unit-tests with no C deps. The unicharset_dump example renders the oracle-shape table so byte-parity is a single diff.

Board hygiene in-commit: LATEST_STATE Contract Inventory (D-CPP-CODEGEN-1, D-UNICHARSET-1). Plan transcode-extend-core-probe-v1.md carries the full 5-consolidate + 3-brutal council record and the C-FIRST D → emitter → EXTEND-CORE arc. 644 contract lib tests green; clippy -D warnings + fmt clean.

Pairs with the ruff PR (the harvester/codegen that produces what these types consume).

🤖 Generated with Claude Code

https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1


Generated by Claude Code

Summary by CodeRabbit

  • New Features

    • Added support for loading and querying Tesseract character set files with bidirectional mapping lookups
    • Added method signature registry with lookup functionality
    • Added CLI tool for dumping and inspecting character set data
  • Documentation

    • Added detailed planning documentation for infrastructure extensions

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@AdaWorldAPI, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 46 minutes and 45 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 79aacaa3-c1a3-4599-80f5-fff93a0bc854

📥 Commits

Reviewing files that changed from the base of the PR and between e82f202 and dce9961.

📒 Files selected for processing (8)
  • .claude/board/EPIPHANIES.md
  • .claude/board/LATEST_STATE.md
  • .claude/knowledge/core-first-transcode-doctrine.md
  • .claude/plans/transcode-extend-core-probe-v1.md
  • crates/lance-graph-contract/examples/unicharset_dump.rs
  • crates/lance-graph-contract/src/codegen_manifest.rs
  • crates/lance-graph-contract/src/lib.rs
  • crates/lance-graph-contract/src/unicharset.rs
📝 Walkthrough

Walkthrough

Two new public modules are added to the lance-graph-contract crate: unicharset, implementing a Tesseract .unicharset file parser with bidirectional id↔unichar mappings and a dump() CLI example, and codegen_manifest, defining const-constructible C++ method-dispatch signature types and a registry lookup. Planning and board-state documentation tracking these additions is also included.

Changes

lance-graph-contract new modules and planning docs

Layer / File(s) Summary
UniCharSet content store module
crates/lance-graph-contract/src/lib.rs, crates/lance-graph-contract/src/unicharset.rs, crates/lance-graph-contract/examples/unicharset_dump.rs
Adds pub mod unicharset to the crate root. Defines UniCharSet with a reverse: Vec<String> and lookup: HashMap<String, u32>, UniCharSetError enum with Display/Error impls, load_from_str/load_from_file parsers that enforce a declared entry count, and public size, id_to_unichar, unichar_to_id, and dump() methods. Includes the unicharset_dump CLI example and unit tests covering parsing rules, bijection round-trips, exact dump formatting, and all error variants.
codegen_manifest types and registry lookup
crates/lance-graph-contract/src/lib.rs, crates/lance-graph-contract/src/codegen_manifest.rs
Adds pub mod codegen_manifest to the crate root. Defines MethodSig as a const-constructible struct with &'static fields and arity()/is_override() helpers, ClassMethods associating a classid with a generated &'static [MethodSig] slice, and methods_for() which returns the method slice for a registered classid or an empty slice for unknown ones. Includes unit tests for const construction and registry resolution.
Board state and EXTEND-CORE planning docs
.claude/board/LATEST_STATE.md, .claude/plans/transcode-extend-core-probe-v1.md
Adds two contract inventory entries to LATEST_STATE.md. Introduces transcode-extend-core-probe-v1.md, a 460-line planning document recording the full EXTEND-CORE proposal, C-FIRST sequencing decision, brutal-panel critiques, and landed milestones for D, the ruff_cpp_codegen emitter scaffold, codegen_manifest, and the unicharset content store, with the remaining byte-parity probe described as operator-gated.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐇 Hop, hop—two new modules spring to life today,
UniCharSet maps each glyph from id to display,
MethodSig stands const, a signature so neat,
The registry returns an empty slice—no defeat!
With dump and round-trip tests all gleaming bright,
This rabbit stamps the contract: compile-time right! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main changes: adding MethodSig codegen target and UniCharSet content store to lance-graph-contract.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e82f202cde

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/lance-graph-contract/src/unicharset.rs Outdated
@AdaWorldAPI AdaWorldAPI force-pushed the claude/happy-hamilton-0azlw4 branch from 21f325f to fbb7b11 Compare June 17, 2026 20:18
claude added 9 commits June 17, 2026 20:49
Records the full 5-consolidate + 3-brutal-critique council on the
Tesseract C++->Rust transcode next-move decision.

5 consolidation agents: core-first-architect (TARGETS-CORE, found the
ocr.rs classid-keyed-registry precedent), container-architect
(ADDITIVE-CONFIRMED, zero locked-node impact), adapter-shaper
(THIN-CONFIRMED, scoped to old_style_included_), truth-architect
(PREMATURE), integration-lead (SEQUENCE C->A->D->B).

3 brutal critics converge: the original "self-authored golden + run-twice
determinism" gate is a tautology (truth-architect, brutally-honest-tester,
adk-behavior-monitor all independently). Replacement gate, named by all
three: a round-trip structural-equivalence falsifier
(expand -> ndjson -> from_ndjson -> reassemble, assert ~= original) that
is immune to the harvest-freshness drift (live 2032 triples vs committed
880) and CAN fail (real UNICHARMAP vs UNICHARSET unichar_to_id collision).
baton-handoff-auditor: CATCH-LATENT, no CATCH-CRITICAL; bakes in the
design constraint that reassembly derives per-overload identity from the
index-prefixed has_param_type triples, never the (params) IRI suffix
(no clean inverse for comma-bearing templated types).

Final 8/8 decision: execute re-scoped C-FIRST. OCR-SCHEMA mis-cite dropped;
Frankenstein-refusal becomes an honest deny-list test; PARITY: UNRUN
honesty markers required; byte-parity promotion stays operator-gated.

Co-Authored-By: Claude <noreply@anthropic.com>
https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
…er green

The re-scoped C-FIRST gate is built in ruff: reassemble() (generator stage 1,
inverse of expand) + the round-trip falsifier that replaced the council-
rejected self-golden. CPP-REASSEMBLE-RT runs green on real Tesseract ccutil
(67 classes; class-set preservation + idempotence). The falsifier found a real
bug class — 19/67 const-overload IRI collisions (GAP-CONST-OVERLOAD, queued).
truth-architect's PREMATURE flag resolved: real measurement on real data.

Co-Authored-By: Claude <noreply@anthropic.com>
https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
Records the step-2 emitter council. Major reframe (3/5 agents independently):
the plan's premise was false — ClassView is a field/render vocabulary with NO
method-resolution surface (has_function does not appear in lance-graph-contract).

5-agent consolidation: manifest-first cut (not stubs/bodies), additive
(container ADDITIVE-CONFIRMED), placement = a new ruff_cpp_codegen crate that
emits TEXT only (no ruff->lance-graph compile edge), D (const-overload fix)
before the emitter, PARITY markers + Frankenstein deny-list.

Unresolved fork handed to the 3-brutal panel: (A) mint a minimal MethodSig POD
+ classid_methods LazyLock registry (method-axis sibling of classid_read_mode,
EXTEND-CORE but additive) vs (B) ride the SHIPPED codegen_spine::TripletProjection
+ roundtrip_eq (no new type; warns MethodSig re-implements CppMethod and that
re-emitting the harvest is a tautology). codegen_spine.rs verified real (632 lines,
TripletProjection trait + roundtrip_eq fn + Genericity enum).

Co-Authored-By: Claude <noreply@anthropic.com>
https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
…are IRI)

The brutal panel resolved the A-vs-B fork as a false binary (honest-tester):
B's codegen_spine::roundtrip_eq pattern is the build-time GATE, A's MethodSig
shape is the emitted-text target, A's runtime registry is a deferred additive
EXTEND-CORE. baton-auditor's decisive correction (confirmed by honest-tester):
the const-overload merge is UPSTREAM in expand's (s,p,o) dedup, so the round-trip
cannot observe it — it is a fixed point. Therefore D (cv-aware method IRI) must
run FIRST, unanimously (behavior-monitor's emitter-first rationale refuted).

Final order: D (cv-aware IRI, autonomous correctness fix — adds no predicate,
falsifier CPP-REASSEMBLE-RT 48/67 -> 67/67) -> emitter scaffold (ruff_cpp_codegen,
emit-text-only, ruff-side round-trip gate, no classid mint) -> MethodSig
EXTEND-CORE in lance-graph (additive) -> wire/byte-parity (operator).

Co-Authored-By: Claude <noreply@anthropic.com>
https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
…nst assumption

Records D's outcome: the cv-aware method IRI + two round-trip-metric fixes
(cpp_projection dedup; is_const in the methods sort key) closed the entire
collision tail (48/67 -> 67/67, now a hard gate). The falsifier overturned the
council's "19/67 = const overloads" inference: only 3 were const; 13 were benign
duplicate template_instantiates, 2 duplicate-harvested methods, 1 a sort-order
artifact. GAP-CONST-OVERLOAD resolved. Next: emitter scaffold (ruff_cpp_codegen)
then the MethodSig EXTEND-CORE.

Co-Authored-By: Claude <noreply@anthropic.com>
https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
…real corpus

The C-FIRST step-2 emitter shipped in ruff: project(ModelGraph)->MethodSig
manifest + render to lance-graph-naming Rust text (emit-text-only, no
lance-graph edge), gated by a decompile==expand signature-plane round-trip with
teeth (dropped-method test proves it fails). CPP-CODEGEN-RT on real ccutil:
67 classes, 857 methods -> 124 KB MethodSig manifest, round-trip holds, PARITY
markers + Frankenstein deny-list present. Next (operator-gated, additive canon
growth): the MethodSig EXTEND-CORE in lance-graph-contract.

Co-Authored-By: Claude <noreply@anthropic.com>
https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
…ompile target

The MethodSig EXTEND-CORE (C-FIRST step 2's deferred-runtime-registry piece).
A new codegen_manifest module:

- MethodSig: the dispatch-relevant C++ method signature in a const-constructible
  shape (all fields &'static: name, params: &'static [&'static str], ret,
  is_const, is_static, overrides). It is the exact literal ruff_cpp_codegen::render
  emits, so the generated text now has a real compile target. The &'static shape
  is load-bearing: class_view::FieldRef is String-backed and cannot appear in a
  const; MethodSig is the method-axis sibling that can.
- ClassMethods{classid, methods} + methods_for(registry, classid): the
  registry-entry type + pure zero-fallback lookup (unregistered classid -> empty
  slice). classid is bound OGAR-side, never minted here; the runtime
  classid->methods registry DATA is generated downstream (consumer repo), NOT
  stored here (honest-tester's "defer the runtime registry").

Additive (container-architect ADDITIVE-CONFIRMED): a sibling module, zero
NodeRow/ValueTenant/ValueSchema/stride/ENVELOPE_LAYOUT_VERSION impact. Body-shaping
flags (pure-virtual/constexpr/noexcept/operator/requires) are out of scope.
Board hygiene: LATEST_STATE Contract Inventory updated same commit
(D-CPP-CODEGEN-1). +2 tests (const-constructibility proof + zero-fallback);
640 contract lib green; clippy -D warnings clean.

C-FIRST: D + emitter scaffold + MethodSig EXTEND-CORE all landed; the in-env arc
is complete. Remaining is operator-gated (tesseract-rs wiring + byte-parity).

Co-Authored-By: Claude <noreply@anthropic.com>
https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
…e's Rust side

The deferred Option A content-store tier, built (operator's "keep building here"
+ the leptonica-is-an-install-not-a-transcode epiphany). New unicharset module:
UniCharSet (deepnsm::Vocabulary-shaped: reverse id->unichar + lookup
unichar->id), load_from_str/load_from_file parsing the .unicharset text format
(line 1 = count; first whitespace token per line = unichar; id = position;
property columns ignored — the old_style_included_ plain-table scope),
id_to_unichar/unichar_to_id (the two adapter leaves), and dump() rendering the
<id>\t<unichar> table matching the C++ oracle.

This is the Rust side of PROBE-OGAR-ADAPTER-UNICHARSET. The unicharset path is
pure text parsing — ZERO leptonica (never touches Pix) — so it builds and
unit-tests in-env with no C deps. leptonica is only an *install* (a link dep of
the C++ oracle harness), never a transcode and never in the Rust path. Byte
parity is now one `diff`: combine_tessdata to get eng.unicharset, a ~10-line
libtesseract harness dumps id_to_unichar, `cargo run --example unicharset_dump`
dumps the Rust side, diff. Byte-identical => CONJECTURE -> FINDING.

Additive (sibling content-store module, zero NodeRow/tenant impact). Board:
LATEST_STATE Contract Inventory (D-UNICHARSET-1). +4 tests + the unicharset_dump
example; 644 contract lib green; clippy -D warnings + fmt clean. The
classid->&UniCharSet LazyLock resolver (OGAR wiring) remains the follow-up.

Co-Authored-By: Claude <noreply@anthropic.com>
https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
…ICHARSET FINDING

leptonica installed in-env (apt-get — an install, not a transcode), so the
byte-parity probe RAN and passed. UniCharSet dump vs a C++ UNICHARSET FFI oracle
on the real eng.lstm-unicharset: 112/112 byte-identical.

The falsifier did its job: the documented-format parser matched 111/112; the
oracle named the one real convention it missed — the NULL file-token IS the space
unichar (unicharset.cpp:882 remaps "NULL" -> " "). One-line fix
(load_from_str maps "NULL" -> " "), re-diff, 0 differences. NOT a Core gap.

CONJECTURE -> FINDING for the unicharset adapter: the variable-length bijection
rides the content-store tier with no Core gap and is byte-exact with libtesseract.
Doctrine flipped (core-first-transcode-doctrine.md falsifier RESULT); EPIPHANIES
E-CPP-PARITY-1; plan BYTE-PARITY ACHIEVED. The classid->ClassView->UnifiedStep
dispatch wiring is mechanical remainder; the lookups themselves are now proven.
+1 test (null_token_maps_to_space); contract lib green; clippy + fmt clean.

Co-Authored-By: Claude <noreply@anthropic.com>
https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
@AdaWorldAPI AdaWorldAPI force-pushed the claude/happy-hamilton-0azlw4 branch from fbb7b11 to dce9961 Compare June 17, 2026 20:50
@AdaWorldAPI AdaWorldAPI merged commit 620bd8e into main Jun 17, 2026
5 checks passed
AdaWorldAPI pushed a commit that referenced this pull request Jun 17, 2026
Records the merged #521 (lance-graph-contract C++ codegen target MethodSig +
UniCharSet content store) per the Mandatory Board-Hygiene Rule's post-merge
step. PR_ARC_INVENTORY prepend (Added/Locked/Deferred/Docs/Confidence) +
LATEST_STATE narrative entry + "Recently Shipped PRs" table row.

Captures the PROBE-OGAR-ADAPTER-UNICHARSET FINDING: the full transcode
pipeline (ruff ruff_cpp_spo harvest -> reassemble -> ruff_cpp_codegen -> these
contract types) produces a UniCharSet byte-identical 112/112 to the libtesseract
oracle on real eng data, proving the core-first transcode doctrine end-to-end.
Pairs with ruff #20. Merge commit 620bd8e.

Co-Authored-By: Claude <noreply@anthropic.com>
https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
AdaWorldAPI pushed a commit that referenced this pull request Jun 18, 2026
Records the merged #521 (lance-graph-contract C++ codegen target MethodSig +
UniCharSet content store) per the Mandatory Board-Hygiene Rule's post-merge
step. PR_ARC_INVENTORY prepend (Added/Locked/Deferred/Docs/Confidence) +
LATEST_STATE narrative entry + "Recently Shipped PRs" table row.

Captures the PROBE-OGAR-ADAPTER-UNICHARSET FINDING: the full transcode
pipeline (ruff ruff_cpp_spo harvest -> reassemble -> ruff_cpp_codegen -> these
contract types) produces a UniCharSet byte-identical 112/112 to the libtesseract
oracle on real eng data, proving the core-first transcode doctrine end-to-end.
Pairs with ruff #20. Merge commit 620bd8e.

Co-Authored-By: Claude <noreply@anthropic.com>
https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
AdaWorldAPI pushed a commit that referenced this pull request Jun 18, 2026
Records the merged #521 (lance-graph-contract C++ codegen target MethodSig +
UniCharSet content store) per the Mandatory Board-Hygiene Rule's post-merge
step. PR_ARC_INVENTORY prepend (Added/Locked/Deferred/Docs/Confidence) +
LATEST_STATE narrative entry + "Recently Shipped PRs" table row.

Captures the PROBE-OGAR-ADAPTER-UNICHARSET FINDING: the full transcode
pipeline (ruff ruff_cpp_spo harvest -> reassemble -> ruff_cpp_codegen -> these
contract types) produces a UniCharSet byte-identical 112/112 to the libtesseract
oracle on real eng data, proving the core-first transcode doctrine end-to-end.
Pairs with ruff #20. Merge commit 620bd8e.

Co-Authored-By: Claude <noreply@anthropic.com>
https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants